https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe
Our analysis primarily revolves around this dataset, with several supplementary datasets appended to this one for further in-depth analysis. This dataset contains 208,000 different rat sightings in the City of New York between 2010 to the present day, reported by citizens to the City of New York and accessed from NYC Open Data. 38 different variables are recorded for each sighting; notably, geographic data such as latitude, longitude, and borough data, and the date of opening and closing of the complaint.
We join various auxiliary datasets (described below) to our rat sightings dataset in order to better examine how rat sightings correlate to other demographic and geographic factors.
https://data.cityofnewyork.us/Transportation/Subway-Entrances/drex-xx56
This dataset, also sourced from NYC Open Data, contains the names, line numbers, and geographic coordinates of 1928 subways in New York City to date.
https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi
This is an 2019 IRS-sourced dataset which contains tax return information for each of the 178 zip codes in NYC; namely, the number of returns and total amounts requested by eligible citizens of each of the zip codes for their individual tax returns.
https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
This is an NYC Open Data dataset, most recently updated on December 10, 2022, containing 231,000 data, each corresponding to a health violation citation given to a restaurant in NYC by the City of New York’s Health Department. We are given 27 different variables that most importantly provide the location and zip code of each restaurant which was issued a citation.
Going into this project, our group had several questions we wanted to answer regarding the distribution of rats in the city. Namely:
In all, we hope to make underlying observations that extend beyond the mere topic of rats, using rat sightings as a proxy for deeper conclusions about socioeconomic and geographic patterns in the City of New York.
Our first visualization performs some elementary EDA on the distribution of rat sighting counts given the borough of their reporting. We created this graph in order to very directly address our research question of how rat sightings differ by borough.
borough.counts <- as.data.frame(table(subset(rats, Borough != "Unspecified")$Borough))
names(borough.counts) = c("Borough","Count")
borough.counts <- rownames_to_column(borough.counts)
borough.counts <- borough.counts %>% filter(!row_number() %in% c(1))
borough.counts
## rowname Borough Count
## 1 2 BRONX 38652
## 2 3 BROOKLYN 74302
## 3 4 MANHATTAN 54608
## 4 5 QUEENS 30476
## 5 6 STATEN ISLAND 8579
ggplot(data = borough.counts, aes(x=Borough, y=Count)) +
geom_col(aes(fill=Borough)) +
labs(title="Number of Rat Sightings by Borough")
This bar chart displays the number of rats seen within each borough in
New York City. Brooklyn had by far the most rat sightings at 74,302,
followed by Manhattan, the Bronx, Queens, and Staten Island, in that
order. The low number of rat sightings in Staten Island might reflect
its more cut-off nature from the rest of the city, as well as its more
suburban feel, which could plausibly explain why Staten Island suffers
less from the very urban problem of rats compared to the other boroughs
in the city. Similarly, Brooklyn’s position in the dead center of the
city may explain why it had so many rats. Despite being the smallest
borough by land size, Manhattan had the second-most rats, which may
reflect the fact that it is one of the main business centers in the city
(and the world) which would obviously attract a large number of rats
with high concentrations of people and food. Thus, this simple
visualization of rat sighting counts allows for greater generalizations
about the boroughs in the city.
ggplot() +
geom_polygon(data = nyczips, aes(x = long, y = lat, group = group, fill = n_rats)) +
theme_void() +
# scale_fill_gradient2(low = "darkblue", mid = "purple", high = "pink", midpoint=3000) +
coord_map() + labs(
title = "Rat Sightings by Zip Code",
fill = "# Rats"
)
private = c(
"1-2 Family Dwelling",
"1-2 Family Mixed Use Building",
"1-2 FamilyDwelling",
"1-3 Family Dwelling",
"1-3 Family Mixed Use Building",
"3+ Family Apartment Building",
"3+ Family Apt",
"3+ Family Apt.",
"3+ Family Apt. Building",
"3+ Family Mixed Use Building",
"3+Family Apt.",
"Apartment",
"Private House",
"Residence",
"Residential Building",
"Residential Property",
"Single Room Occupancy (SRO)"
)
commercial = c(
"Cafeteria - Public School",
"Catering Service",
"Commercial Building",
"Commercial Property",
"Construction Site",
"Day Care/Nursery",
"Government Building",
"Grocery Store",
"Hospital",
"Office Building",
"Restaurant",
"Restaurant/Bar/Deli/Bakery",
"Retail Store",
"School",
"School/Pre-School",
"Store",
"Street Fair Vendor",
"Summer Camp"
)
public = c(
"Abandoned Building",
"Beach",
"Building (Non-Residential)",
"Catch Basin/Sewer",
"Ground",
"Parking Lot/Garage",
"Public Garden",
"Public Stairs",
"Street Area",
"Vacant Building",
"Vacant Lot",
"Vacant Lot/Property"
)
other = c(
"",
"N/A",
"Other",
"Other (Explain Below)"
)
rats$Location.Type[rats$Location.Type %in% private] <- "Private"
rats$Location.Type[rats$Location.Type %in% public] <- "Public"
rats$Location.Type[rats$Location.Type %in% commercial] <- "Commercial"
rats$Location.Type[rats$Location.Type %in% other] <- "Other"
rats$Location.Type <- factor(rats$Location.Type)
rats$Date = as.Date(rats$Created.Date, "%m/%d/%Y")
rats_per_day = rats %>%
group_by(Date, Location.Type) %>%
tally()
names(rats_per_day) = c("date", "locationtype", "n_rats")
# ggplot(data=rats_per_day, aes(x=date, y=n_rats, color=locationtype)) +
# geom_line(alpha=0.3) + labs(
# title="Number of Rats Recorded on Each Day",
# subtitle="colored by the type of location",
# x="Date",
# y="Number of Rats Recorded"
# ) +
# scale_color_manual("Location Type",
# values = c("Other" = "yellow",
# "Commercial" = "blue",
# "Private" = "red",
# "Public" = "green"))
mayors$Date.Start <- as.Date(mayors$Date.Start)
mayors$Date.End <- as.Date(mayors$Date.End)
events$Date <- as.Date(events$Date)
ggplot(rats_per_day) +
geom_line(aes(date, n_rats, color=locationtype), alpha=0.3) + labs(
title="Number of Rats Recorded on Each Day (with Mayors)",
subtitle="colored by the type of location",
x="Date",
y="Number of Rats Recorded"
) +
scale_color_manual("Location Type",
values = c("Other" = "yellow",
"Commercial" = "blue",
"Private" = "orange",
"Public" = "green")) +
geom_rect(
data = mayors,
aes(xmin = Date.Start, xmax = Date.End, fill = Party),
ymin = -Inf, ymax = Inf, alpha = 0.1
) +
geom_vline(
aes(xintercept = as.numeric(Date.Start)),
data = mayors,
colour = "grey50", alpha = 0.5
) +
geom_text(
aes(x = Date.Start+60, y = 110, label = Name),
data = mayors,
size = 3, vjust = 0, hjust = 0, nudge_x = 50, angle = 90) +
geom_segment(data = events, aes(x = Date, y = 40, xend = Date, yend = 55), color = "red") +
geom_text(data = events, aes(x = Date-50, y = 95, label = Event.Name), angle=90, color = "red")
scale_fill_manual(values = c("blue", "red"))
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: fill
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: grey50
## name: waiver
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## rescale: function
## reset: function
## scale_name: manual
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
Building on this time series analysis, we now turn to a seasonal approach to modeling rat sightings over time, hoping to further address our research question of how temporal factors impact rat sighting counts.
library(ggplot2)
rats_per_day = rats %>%
group_by(Date) %>%
tally()
rats_per_day$Season = time2season(rats_per_day$Date, out.fmt = "seasons")
rats_per_day$Season = ifelse(rats_per_day$Season == "autumm", "autumn", rats_per_day$Season)
ggplot(data=rats_per_day, aes(x=Date, y=n, color = Season)) +
geom_line(alpha=0.3) + labs(
title="Number of Rats Recorded on Each Day",
subtitle="colored by the season",
x="Date",
y="Number of Rats Recorded"
) +
stat_rollapplyr(color = "red", width = 30, align = "left", alpha = 0.5) +
ggtitle("Width = 30")
## Warning: Removed 29 rows containing missing values (`geom_line()`).
The above graph plots the moving average for the number of rats seen each month in the red line in order to track the trends, as well as the actual observed number of rats per day. Furthermore, we colored the observed rats by the season in which it was observed, and found a harmonic pattern - there would always be a lot of rats observed in the summer, and not many rats observed in the winter (except for one fateful day in 2017!). This could reflect a few things - rats don’t like the cold and tend to stay inside, so they are less likely to be seen. However, humans don’t like cold either, so they are less likely to go outside and observe rats in New York. Overall, it is interesting to note the changes in rat observations each season.
ggplot() +
geom_polygon(data = nyczips, aes(x = long, y = lat, group = group, fill = rat_to_tax)) +
theme_void() +
scale_fill_gradient2(low = "#395184",
mid = "#A964B8",
high = "#FFA9A9", midpoint = 1500) +
coord_map() + labs(
title = "Rat to Tax Rating Ratio by Zip Code",
fill = "Number of Rats / Tax Rating (from 1 to 6)"
)
With some temporal and time series analysis of rat sightings done, we now turn to analyzing the conditional distribution of rat sighting locations given borough in an effort to address the degree to which this property of a given rat sighting differs between boroughs.
rats_other = subset(rats, !grepl("Family", rats$Location.Type, fixed = TRUE))
rats_other = subset(rats_other, Location.Type != "Other (Explain Below)")
rats_other = subset(rats_other, Location.Type != "Street Area")
rats_other = subset(rats_other, (Borough == "BRONX" | Borough == "BROOKLYN" | Borough == "MANHATTAN" | Borough == "QUEENS" | Borough == "STATEN ISLAND"))
rats_other = rats_other %>% group_by(Location.Type) %>% filter(n() > 300 )
rats_other["Location.Type"][rats_other["Location.Type"] == "Vacant Building" | rats_other["Location.Type"] == "Vacant Lot"] <- "Unoccupied"
rats_other["Location.Type"][rats_other["Location.Type"] == "Government Building" | rats_other["Location.Type"] == "Commercial Building"] <- "Office Building"
par(mar = c(5,4,1,10))
mosaicplot(table(rats_other$Borough, rats_other$Location.Type), main = "Mosaic Plot of Non-Family Location Types by Borough", shade=TRUE, las=2)
This mosaic plot visualizes the conditional distribution of reporting sites of rats given borough. Based on this visualization, we see many statistically significantly high and low combinations of borough and reporting site under the assumption of independence between the two variables plotted. It is interesting to note the way rat reporting sites reflect the distinctive landscapes of each borough. For instance, we have significant evidence that Manhattan has higher proportions of rat sightings made at office buildings and construction sites than would be expected under independence, which reflect’s Manhattan’s reputation as a bustling metropolis with many developed and developing commercial construction projects. It is also interesting to note that the statistical significance of the proportion of reports made in Unoccupied sites (which we categorized as reports made in either Vacant Buildings or Vacant Lots) for every single borough; the high proportion of such sightings in Staten Island, Brooklyn, and the Bronx may suggest the presence of pockets of high poverty or low economic development in these boroughs, and the significantly low proportion of sightings in Unoccupied regions may suggest a relatively high degree of property and economic development in these boroughs, where fewer spaces are left unused by homeowners or businesses. In all, reveals that rats are generally found in very different sets of locations in different boroughs.
# importing rats
#nice map (manhattan, brooklyn, queens, a bit of bronx)
left =-74.03
bottom = 40.64
right = -73.87
top = 40.85
nyc_coords <- c(left, bottom, right, top)
#full map (all boroughs)
leftF = -74.2
bottomF = 40.55
rightF = -73.87
topF = 40.85
nyc_coordsF <- c(leftF, bottomF, rightF, topF)
#just dowtown manhattan
leftM = -74.03
bottomM = 40.69
rightM = -73.94
topM = 40.81
nyc_coordsM <- c(leftM, bottomM, rightM, topM)
nyc_map <- get_stamenmap(nyc_coords, maptype = "terrain", zoom = 11)
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
nyc_mapF <- get_stamenmap(nyc_coordsF, maptype = "terrain", zoom = 11)
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
nyc_mapM <- get_stamenmap(nyc_coordsM, maptype = "terrain", zoom = 11)
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
ratSubset <- subset(rats, Longitude<right & Latitude<top & Longitude > left & Latitude >bottom)
ratSubsetF <- subset(rats, Longitude<rightF & Latitude<topF & Longitude > leftF & Latitude >bottomF)
ratSubsetM <- subset(rats, Longitude<rightM & Latitude<topM & Longitude > leftM & Latitude >bottomM)
health_inspection[, c("SCORE")] <- sapply(health_inspection[, c("SCORE")], as.integer)
health_inspection["SCORE"][is.na(health_inspection["SCORE"])] <- 0
healthSubset <- subset(health_inspection, Longitude<right & Latitude<top & Longitude > left & Latitude >bottom & SCORE>50)
healthSubsetF <- subset(health_inspection, Longitude<rightF & Latitude<topF & Longitude > leftF & Latitude >bottomF & SCORE>50)
healthSubsetM <- subset(health_inspection, Longitude<rightM & Latitude<topM & Longitude > leftM & Latitude >bottomM & SCORE>50)
rat_map <- ggmap(nyc_map) +
geom_point(data=ratSubset, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "coral3")
rat_mapF <- ggmap(nyc_mapF) +
geom_point(data=ratSubsetF, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "coral3")
rat_mapM <- ggmap(nyc_mapM) +
geom_point(data=ratSubsetM, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "coral3")
# rat_map
# rat_mapF
# rat_mapM
# ratHealthScore_map <- ggmap(nyc_map) +
# geom_point(data=ratSubset, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "chocolate3")+
# geom_point(data=healthSubset, aes(x=Longitude, y = Latitude, color = SCORE), size = 0.5, alpha=0.2) +
# scale_color_distiller(palette = "PiYG")
#
# ratHealthScore_mapF <- ggmap(nyc_mapF) +
# geom_point(data=ratSubsetF, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "chocolate3")+
# geom_point(data=healthSubsetF, aes(x=Longitude, y = Latitude, color = SCORE), size = 0.5, alpha=0.2) +
# scale_color_distiller(palette = "PiYG")
#
# ratHealthScore_mapM <- ggmap(nyc_mapM) +
# geom_point(data=ratSubsetM, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.1, color = "chocolate3") +
# geom_point(data=healthSubsetM, aes(x=Longitude, y = Latitude, color = SCORE), size = 0.5, alpha=0.2) +
# scale_color_distiller(palette = "PiYG") +
# geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group), color="blue", fill="white", alpha=0.3)
#Restaurants (subset, score worse than 50)
ggmap(nyc_map) +
geom_density_2d_filled(data=ratSubset, aes(x = Longitude, y = Latitude, fill = after_stat(level)), alpha = 0.4) +
geom_point(data=healthSubset, aes(x=Longitude, y = Latitude, color = SCORE), size = 0.3, alpha=0.1) +
scale_color_distiller(palette = "YlOrRd")
### subway map
#SUBWAY STATIONS
ggmap(nyc_map) +
geom_density_2d_filled(data=ratSubset, aes(x = Longitude, y = Latitude, fill = after_stat(level)), alpha = 0.4) +
geom_point(data=subway_entrances, aes(x=longitude, y = latitude), color="red", size = 0.6, alpha=0.2) +
scale_color_distiller(palette = "PiYG")
## Warning: Removed 524 rows containing missing values (`geom_point()`).
Through this analysis, we have learned a multitude of interesting things about the conditional distribution of rats in New York City given such variables as geography, temporal events, and physical landmarks. Clearly the distribution of rats in the city correlates highly with many of our tested variables, and displays significant geographic and temporal activity. It seems that the quanitity of rat sightings differ greatly between boroughs, zip codes within boroughs, and even specific types of locations within different boroughs. Futhermore, rat sightings display a significant trend and seasonal over time, all the while responding to major events that occur in the city. Future analysis of this topic would do well to analyze a) different datasets that could potentially be compared to rat sighting distributions such as racial or age-related data in order to assess how people of different social groups experience varying levels of rats in their homes, and/or b) dive deeper into the auxiliary variables which we had already selected; for example, correcting for geographic area in our borough and zip code data in order to calculate and visualize how the rats per square mile (and by extension, variables involving rat sighing counts such as rat sighting density to tax rating ratio) changes between geographic regions. In all, this project provided a thoughtful insight into life in New York City from the perspective of its most mainstay citizens - the rats.